XFT: Practical Fault Tolerance beyond Crashes
نویسندگان
چکیده
Despite 30+ years of intensive research, the distributed computing community still does nothave a practical answer to non-crash faults of the machines that comprise a distributed system.In particular, Byzantine fault-tolerance (BFT), that promises to handle such faults, has not livedto expectations due to its resource and operation overhead with respect to its crash fault-tolerant(CFT) counterparts. This overhead comes from the worst-case assumption about Byzantine faults,in the sense that some coordinated adversarial activity controls the faulty machines and the entirenetwork at will. To practitioners, however, such strong attacks appear irrelevant.In this paper, we introduce XFT (“cross fault tolerance”), a novel approach to building reliabledistributed systems, that decouples the fault space across the machine and network faults dimen-sions, treating machine faults and network asynchrony separately. This is in sharp contrast to theexisting CFT and BFT models that discern system faults only along the machine fault dimension.XFT offers much more flexibility than traditional synchronous and asynchronous models that (toostrictly) fix the network fault model of interest regardless of the machine faults.As the showcase for XFT, we present Paxos++: the first state machine replication protocolin the XFT model. Paxos++ tolerates faults beyond crashes in an efficient and practical way,featuring many more nines of reliability than the celebrated crash-tolerant Paxos protocol, withoutimpacting its resource/operation costs while maintaining the same performance (common-casecommunication complexity among replicas). Surprisingly, Paxos++ sometimes (depending on thesystem environment) even offers strictly stronger reliability guarantees than state-of-the-art BFTreplication protocols.
منابع مشابه
From Viewstamped Replication to Byzantine Fault Tolerance
The paper provides an historical perspective about two replication protocols, each of which was intended for practical deployment. The first is Viewstamped Replication, which was developed in the 1980’s and allows a group of replicas to continue to provide service in spite of a certain number of crashes among them. The second is an extension of Viewstamped Replication that allows the group to s...
متن کاملReliable Broadcast in a Computational Hybrid Model with Byzantine Faults, Crashes, and Recoveries
This paper presents a formal model for asynchronous distributed systems with parties that exhibit Byzantine faults or that crash and subsequently recover. Motivated by practical considerations, it represents an intermediate step between crash-recovery models for distributed computing and proactive security methods for tolerating arbitrary faults. The model is computational and based on complexi...
متن کاملComparison of Failure Detectors and Group Membership: Performance Study of Two Atomic Broadcast Algorithms
Protocols that solve agreement problems are essential building blocks for fault tolerant distributed systems. While many protocols have been published, little has been done to analyze their performance, especially the performance of their fault tolerance mechanisms. In this paper, we present a performance evaluation methodology that can be generalized to analyze many kinds of fault-tolerant alg...
متن کاملImproved Fault Tolerant Elastic Scheduling Algorithm for Cloud Computing
The paper focus on Fault Tolerance, a long standing problem in cloud computing by extending Primary Backup model to include cloud features such as virtualization and elasticity. Fault tolerance is a challenging work in Cloud Computing as virtual machines are the basic computing instances rather than hosts that enable virtual machines to migrate to other hosts. The on demand provisioning of reso...
متن کاملMeasuring Fault Tolerance with the FTAPE Fault Injection Tool
This paper describes FTAPE (Fault Tolerance And Performance Eval-uator), a tool that can be used to compare fault-tolerant computers. The major parts of the tool include a system-wide fault injector, a workload generator, and a workload activity measurement tool. The workload creates high stress conditions on the machine. Using stress-based injection, the fault injector is able to utilize knowl...
متن کامل